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ABSTRACT 

Item response theory (IRT) has been used extensively 
to study differential item functioning (dif) and to identify 
potentially biased items. The use of IRT for diagnostic purposes is 
less prevalent and has received comparatively less attention. This 
study addressed differential objective function (dof) to identify 
potentially biased content units. IRT was used to estimate person 
abilities and item difficulties, which were used to compute residual 
objective scores. Residual objective scores were analyzed with 
analysis of variance using the independent variables gender and 
ethnicity. Data were from mathematics subtests from the 1992 
Connecticut Mastery Test census administration of eighth graders and 
its database of approximately 32,000 Connecticut eighth graders. The 
examples illustrate how dof outcomes can be used to identify 
potentially biased content units, to provide diagnostic information 
at the content level, and to construct profiles of content-based 
performanr e for different demographic subgroups. Ten figures and two 
tablos present analysis results. Two appendixes present dif 
statistics by demographic subgroup and item-level statistics for dof 
objectives in four tables. (Contains 11 references.) (Author/SLD) 
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ABSTRACT 
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RATIONALE 

Applications of item response theoretic (IRT) methods liave enhanced the process 
of test development and test construction (Hambleton, 1989), evolutionized computer- 
adaptive testing technology, and facilitated test equating procedures. The item function 
is defined by simultaneously estimating person and item parameters, and is expected to 
be comparable between matched-ability groups that differ on characteristics independent 
of ability. IRT-based methods present significant contributions to the investigation of 
differential item functioning (dif) and potential item bias. (See, for example, reviews by 
Ironson, 1983, and Shepard, Camilli, & Averill, 1981.) IRT-based methods have 
additionally been used to test the sufficiency of model-data fit and its relationship to 
potential item bias (Linn & Hamisch, 1981; Wright, Mead, & Draba, 1976). 

Despite the widespread use of IRT in technical areas of test development, its 
application for curricular diagnosis and content analysis is less prevalent. Popular dif 
methods use discrete items as the unit of analysis. Although those types of analyses 
serve multiple purposes, interpretatioris about the test content are not inherently tied to 
those methods. Identifyin.g potential bias through differential function at any level ~ 
item, objective, or other content-based uiiits ~ is ultimately a function of a substantive 
content review. Traditional methods typically address only one factor at a time and 
ignore interaction effects from multiple dif factors like gender and ethnicity. Ignoring 
dif interaction effects can result in misleading interpretatioris about dif main effects. 

Tang (1994) proposed an IRT-ANOVA method which addresses simultaneous dif 
analysis for multiple levels and multiple factors. IRT is used to estimate person ability 
and item difficulty parameters. Residual scores, free from the effects of person ability 
and item difficulty, are computed. Differences in residual scores between different 
demographic groups, defined by different levels of dif factors, are then tested with 
analysis of variance. Any significant differences in the demographic groups' mean 
residual scores may be an indication of potential bias. 
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The current study extends the IRT-ANOVA method's unit of analysis from the 
discrete item level to the content-based unit. The empirical analysis of content-based 
units contextualizes the statistical significance of discrete dif items. This content-based 
extension presents several advantages over traditional dif methods: 

i. the analyses are performed on content-based units; 

ii. the method can simultaneously address multiple levels and multiple 
factors; 

iii. interaction effects can be studied while controlling for confovinding 
variables; 

iv. the outcomes lend themselves readily to content-based interpretations; and 
V. content-based interpretations are more amenable to diagnostic applications. 

The content units may be defined by curricular objectives, content domains, or other 
substantive units which are used to define test content. The content unit in this study 
is the curricular objective and its analysis is referred as differential objective function. 

Differential objective function (dof) occurs when objectives fxmction differently for 
paiticular subgroups of examinees irrespective of underlying ability. The presence of 
dof may be attributed to differences in opportunity to learn (Lehman, 1986; Muthen, Kao, 
& Burstein, 1991), in instructional bias (Linn & Hamisch, 1981), or in other curricular 
factors. Lower levels of performance may be attributed to differences in instructional 
delivery and in opportunity to learn. Given the tenability of model assumptions, 
differences in item performance between matched-ability groups are indicative of dif 
Dof is more likely than item-level dif to yield content-based explanations about the 
observed differences between matched-ability groups. Outcomes at the objective level 
can provide collateral information which otherwise remains untapped from outcomes 
of discrete items alone. 

The results from this study illustrate how dof can inform interpretations of item 
analysis: It augments dif dota, contextualizes the significance of item statistics, and 
provides diagnostic information at the objective level. 
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METHOD 

SAMPLE 

The current study is a secondary analysis of mathematics subtest data from the 
1992 Connecticut Mastery Test census admirustration of eighth-graders in Connecticut 
public schools. The mathematics subtest consisted of 144 dichotomously-scored multiple 
choice items. These items measured mathematics performance on 36 curricular 
objectives, each comprised of four items. 

Two dichotomized student background variables - gender (Female/Male) and 
ethnicity (Black/White) - were the dof factors and formed the sampling strata for the 
study. From the database of approximately 32,000 Connecticut eighth-graders, item 
responses and demographic data from 400 examinees were randomly sampled from each 
(gender x ethnicity) demographic stratum to yield a total sample size of 1600. 

LIMITATIONS OF THE STUDY 

This study is a secondary analysis of an existing data set which does not include 
information about methods of instructional delivery, opportunity to learn, or 
instructional bias. The current analyses exclude attempts to validate the interpretation 
of dof as a function of any of these factors. The methods described in this study are 
reported as part of developmental work in an area which warrants further consideration 
and continued research. 

PROCEDURE 

At the objective level, expected performance was modeled as a function of 
examinee ability and difficulty of the objective. Residual objective scores are a function 
of item scores adjusted for person ability and item difficulty, and reflect the difference 
between the expected and observed objective scores. They are expected to be random 
with a mean of 0. A positive (or negative) residual implies that an examinee's score is 
higher (or lower) than expected. Consistently high (or low) residuals for a subgroup 
imply that the objective favors (or disfavors) the subgroup. 
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The procedure applies a one-parameter logistic IRT model to dichotomously- 
scored items. Item responses are assumed essentially unidimensional and locally 
independent within and across objectives. The initial steps at the individual examinee 
level for person n (1, . . ., N); item i {1, . . ., nested within objective ; (1, ...,/) are: 

Step 1. Calibrate the data for the intact group. Obtain estimates of person 
ability (B„) and item difficulty (D,). 

Step 2. Use the estimates obtained in Step 1 to compute person n's expected 

exp(B„ - D-\ 

Item I score, E^^ = — 0^ i__ , The observed item i 

score for person n is X„,-. For dichotomously-scored items, X„,- = 1 
if correct, 0 otherwise. 

Step 3. Compute person n's expected objective score by adding the expected 
item scores nested within objective E„j = ^ ^^.j , The 



observed objective score for person n is the sum of the item scores 
nested within objective X„j = '^^nnj)- 



1=1 



Step 4. Compute person n's residual objective score, R„j = X„j - E„j . 

R„j is the difference between the observed and expected objective 
scores, and it reflects the magnitude of dof {or person n on objective 



Step 5. Apply analysis of variance on the R„/s as the dependent variable 
and rfo/ factors (gender and ethnicity) as the independent variables. 
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Th2 generalized linear model is: 

where Rj = [N x 1] vector of N person residuals, R^j, for objective ;; 

X = [N X G] "desig.i" matrix of N persons' values on each independent dof 

variable in the model; 
^ = [G X 1] vector of regression coefficients for objective and 
= [N X 1] vector of N persons' error terms for objective ;'. 
In this study, ^ takes on the form [p^ |3^„,,,c ^ce,de, PethncxcendeJ'- 

Step 6. Compute the residual mean objective scores for mutually exclusive 
demographic subgroups, defined by the levels of the dof factors. 

The residual mean objective score reflects the magnitude of dof for the 
demographic subgroup. For example, a residual mean obj<^ctive score of 0.15 for a 
subgroup indicates that the group as a whole performed better than expected by 0.15 
objective score points, given the group's ability level and the difficulty of the objective. 



RESULTS 

Residual objective scores were modeled via general linear models, with do/ factors 
gender and ethnicity as independent variables. Dof main effects and two-way 
interactions were tested for significance using the univariate F-ratio as the dof test 
statistic. The magnitude of residual mean difference was used as an additional criterion 
for significant dof. Appendix A presents residual mean objective scores and magnitudes 
of residual mean difference by main effects gender and ethnicity, two-way (gender x 
ethnicity) interactions, their univariate F-ratios, and corresponding p-levels of 
significance. Univariate F-ratios were computed separately for each objective. 

Significant dof was detected on 10 of the 36 objectives for main effects and 2-way 
interactions at the a = 0.01 level. For main effect dof an additional criterion of difference 
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in group residual mean objective scores greater than or equal to 0.15 was applied. These 
results are summarized in Table 1. 

Table 1 

Summary of Significant Dof 

Dof Effect Number of Objectives 

Main Effect Ethnicity* 4 
Main Effect Gender* 7 
2-w?.y Interaction Ethnicity x Gender 2 
Non-significant dof 26 
*3 common objectives, significant dof main effects for ethnicity and gender 

Although significant dof was detected for eight unique objectives at the main 
effects level, these, outcomes should be interpreted in light of at least two considerations: 

(a) The statistical significance of main effects could be attributed to increased 
power and larger sample size (n = 400 examinee responses for each 2-way 
interaction effect, compared to m = 800 examinee responses for each main 
effect). 

(b) Error rates of significance tests increase with repeated significance tests 
performed on the same sample. 

Subsequent discussion of the results and examples of dof are limited to two-way dof 
interactions. 

Do/information and item-level dif data, can enhance content-based interpretations. 
Three objectives ~ Objective 3 with non-significant dof Objectives 10 and 14 with 
statistically significant dof - are highlighted to show how dof interactions can be 
interpreted. Two-way dof plots for the three objectives appear as Figures 1-3. Neither 
of the two-way (ethnic x gender) plots for Objective 3 [Figures 1(a) and 1(b)] shows a 
significant interaction effect at the objective level. Inspection of the objective level data 
reveals no apparent "gender gap" or "ethnicity gap." 
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DOF Plots 
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Objectives 10 and 14 in Figures 2(a), 2(b), 3(a), and 3(b) illustrate two-way 
interactions. The item-level dif data for these objectives are presented in Appendix B. 
Objectives 10 and 14 were flagged with statistically sigiiificant (ethnicity x gender) dof 
interaction and appear to be of substantive sigr ificance. These dof interactions appear 
in Figures 2(a)-3(b). 

The magnitude of the two-way dof interaction is operationalized by the difference 
between group differences. For Objective 10, that magnitude was 0.02 for the "gender 
gap" and 0.10 for the "ethnic gap." According to these methods. Objective lO's group- 
by-objective interaction is more pronounced for different ethnic groups of the same 
gender. Although Objective lO's two-way plots reveal interaction effects, the magnitudes 
of the interaction do not appear to be significant. 

For Objective 14, the magnitude of the two-way dof interaction was 0.22 with 
"gender gap" interaction between Whites and Blacks [(Black Males - Black Females) vs. 
(White Males - White Females)], and 0.30 with "ethnic gap" interaction between Males 
and Females [(Black Males - White Males) vs. (Black Females - White Females)]. The 
difference in residual mean objective scores between White Males and White Females 
was greater than between Black Males and Black Females: Tlie "gender gap" was more 
pronounced for Whites than for Blacks. The difference in residual mean objective scores 
between Black Males and White Males was greater than the difference between Black 
Females and White Females. The "ethnicity gap" was more pronounced among Males 
than among Females, and more distinct than the "gender gap." 

To interpret the rfo/outcomes relative to the items that comprise an objective, two- 
way plots of item-level data are presented for each of Objectives 3, 10, and 14 in Figures 
4-6. As shown in Figures 4(a)-4(d), none of the items (#105-108) associated with 
Objective 3 (round whole numbers) revealed a significant (ethnic x gender) interaction 
effect. For this objective, non-signiflcant item-level dif was consistent with non- 
significant objective-level dof 
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Two-way interaction plots of the four items associated with Objective 10 (identify 
ratios and fractions from pictures) appear as Figures 5(a)-5(d). These plots show 
different patterns of interaction for the four demographic subgroups. Three of the four 
items (#113-115) show statistically non-significant interaction effects for the four 
demographic subgroups and appear to favor Black Females. Item #1 1 6 appears to favor 
White Males, while neither favoring nor disfavoring White Females, Black Females, or 
Black Males. The cumulative interaction effect of items #113-115, in addition to the 
interaction effect of item #116, may have resulted in the statistically significant dof 
interaction. 

The item-level dif data associated with Objective 14 (add/subtract decimals to 
numbers of the form .XX), also flagged for a significant (ethnicity x gender) dof 
interaction, appear in Figures 6(a)-6(d) and show consistent interaction patterns between 
the four demographic subgroups. Ail four items (#65-68) consistently disfavored White 
Males and neither favored nor disfavored White Females, Black Females, or Black Males. 
The item-Javel and objective-level data are consistent. For this objective, and as 
measured by items #65-68, substantive content-based factors appear to differentiate the 
performance of White Males from other demographic subgroups. 

GROUP PERFORMANCE PROFILES 

Group performance profiles are presented in Figures 7-10. Each of the 36 
objectives in this study was categorized into one of four content domains: 

•conceptual understanding. Objectives 1 ~ 11; 
•computational skills. Objectives 12 ~ 21; 
•problem solving & application. Objectives 22 ~ 31; and 
•measurement. Objectives 32 ~ 36, 

partitioned by the vertical dotted lines in each of Figures 7-10. 
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Each performance profile displays the 36 residual mean objective scores for a 
particular demographic group. Objectives which tend to disfavor a group are 
characterized by negative residual mean objective scores. Conversely, objectives that 
favor a group are characterized by positive residual mean objective scores. In these 
performance profiles, objective bands were constructed with ±1 standard error around 
the residual mean objective score. Objective bands that included a zero residual mean 
objective score were classified as "0," neither favoring nor disfavoring a group. 
Objective bands located above the zero residual mean were classified as "+," favoring 
the demographic subgroup; «. .^jective bands that fell below the zero residual mean were 
classified as disfavoring the group. These performance profiles display the relative 
strengths and weaknesses of a demographic group by content domains and for objectives 
which comprise each of the domains. These outcomes are summarized at the level of 
content domains in Table 2. 



Table 2 

Group Objective Performance Summaries by Content Domain 



GROUP 



CONTENT DOMAIN 



CONCEPTUAL 
UNDERSTANDING 



+ 



0 



COMPUTATIONAL 
SKILLS 



+ 



0 



PROBLEM SOLVING/ 
APPLICATION 



+ 



0 



MEASUREMENT 



+ 



0 



White Male 



Black Male 



10 



White Female 



Black Female 



number of objectives in Content Domain that favors the group 
number of objectives in Content Domain that disfavors the group 
niunber of objectives in Content Domain neither favors nor disfavors the group 



+ = 



0 = 



Performance profiles can uncover content-based information that significance tests 
alone carmot. Objectives which fail to show statistically significant do/ are not necessarily 
void of potential bias. An analysis- of the group performance profiles shows, for 
example, that Objectives 13 and 15 both disfavored White Males and Black Males, 
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favored Black Females, and neither favored nor disfavored White Fenriales. Although 
these outconries were not statistically significant for dof, the objectives appeared to 
disfavor Males overall as a group. The performance summaries illustrate how dof can 
be used to diagnose performance at the content level. These methods and examples do 
not, however, diminish the necessity for a substantive review of the content. 

SUMMARY AND DISCUSSION 

The concept of dif was extended to the content imit. The interpretation of dof was 
illustrated with examples of statistically significant rfo/interactions in the context of item- 
level dif data. In the presence of significant dof and consistent patterns of interactions 
at the item- and the objective-levels, dof is attributable to content-based factors. Group 
performance profiles were constructed for each demographic subgroup in the study. 
These profiles identified the relative strengths and weaknesses of objective level 
performance by separate subgroups. Substantive information about potentially biased 
curricular objectives was detected between different group performance profiles. 
Content-based data can be used for diagnostic purposes; they can also augment item- 
level dif data and help contextualize statistical significance. 

According to Bauer (1992), local test development activities continue at a high 
level. A critical step in test development is the identification of potentially biased items 
that favor one group of examinees independent of ability level. As discussed in Skaggs 
and Lissitz's study (1992) of consistency in item bias detection, dif can consistently flag 
items for no apparent reason. Differences in instructional background and opportunity^ 
to learn can be confounded with differences in matched-ability group performance. 
Based on collateral item information, '^.cf can identify objectives that consistently yield 
aberrant results from expected performance at the objective level across different 
demographic groups. 

Recent sur\'eys of test use (Bauer, 1992; Nolen, Haladyna, & Haas, 1992) reported 
that the majority of local school districts and classroom teachers used tests for diagnostic 
ai 1 instructional purposes. If one of the primary purposes of testing is to provide 
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information about the success of instructional delivery or to identify curricular areas for 
remediation, test results should also provide diagnostic information to satisfy these goals: 
This diagnostic information must necessarily be content-based. As illustrated in this 
study, dof can be used to create group performance profiles by instructional units to 
target the relative strengths and weaknesses of demographic groups accordijig to tested 
objectives. 

One direction for future research is to explore the effect of multidimensionality 
on the sensitivity of dof. Test items are usually categorized into different content-based 
units with the assumption that each content-based unit is conceptually distinct. 

Another methodological direction for future research is to explore a hierarchical 
structure for dof analysis. The test blueprint has an inherent structure of test items 
within content objectives, nested within content domains. The dependencies between 
and within nested units may be explicitly modeled through hierarchical methods. 

Although data about opportunity to learn and differences in other curricular 
factors were not available for this study, inclusion of those types of data can only lead 
to more comprehensive and informed inferences about curricular outcomes. 
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